ANSWERS:
The dataset is too large for a direct visual investigation. We did our analysis with the help of database operations. We put the data in an Oracle database and joined the tables with the hospitalization records and the death records in a single table.
We extracted all strings describing the syndromes and analyzed them using simple text processing routines (extraction of unique words, finding combinations including given words) and interactive grouping and classification. In this way, we created a table containing for each symptom the “standard” spelling and possible variants, including abbreviations and synonyms. We excluded the terms denoting injuries, medical operations, and known diseases such as asthma or flu. Then we used a routine that detects occurrences of terms in texts taking into account the possible synonyms. We applied it to the texts extracted from the whole database and from the records corresponding to the deaths and obtained the respective frequencies of the symptoms (Figure 1).
Figure 1. Symptoms, their synonyms, and frequencies in all records and the death records. The rows are ordered according to the frequencies in the death records.
There is a large gap between the frequency of “back pox” (2778) and the next highest frequency (397). For the further analysis, we selected all symptoms from the top of the table to “back pox”. We assumed that there might be light cases of the epidemic disease that did not result in deaths. Therefore, we additionally selected the symptoms with high frequencies (over 100,000) among all records that had not been selected before. In total, we selected 30 symptoms.
We created a database query that transformed the texts describing the syndromes into feature vectors. For each selected symptom, the vectors contain 1 or 0 denoting whether the symptom occurs in the texts. The vectors were represented by 30 binary attributes and by strings consisting of the symbols 0 and 1, called “masks”. We obtained 64 different masks; 63 of them represent various combinations of the selected symptoms. One mask, consisting of zeros, represents the syndromes that do not include any of the selected 30 symptoms.
Now we start the analysis using the system’s interface to Oracle database. We aggregate the data by masks and daily intervals. For each mask, we obtain a time series of hospitalization counts. We visualize them on a time graph display. The series of the zero mask has very high values compared to the rest (Figure 2A). We transform the counts into the differences to the mean values divided by the standard deviations. In the resulting graph, we can easily distinguish the temporal profiles corresponding to the epidemic disease (gradual increase followed by gradual decrease) from the remaining profiles (random fluctuation) (Figure 2B). We use this distinction to classify the syndromes into epidemic and non-epidemic (Figure 2C). When we cancel the transformation (Figure 2D), we see that the time series of the zero mask was classified as non-epidemic. This indicates a high probability that we did not miss any symptom relevant to the pandemic.
Figure 2. Classification of the syndromes into epidemic and non-epidemic according to the temporal variation profiles.
To be absolutely sure, we aggregate only the records with zero masks by syndromes (represented by texts) and daily intervals. We visualize and transform the resulting time series as before. We detect three different temporal patterns: random fluctuations (Figure 3 top), “epidemic” pattern (Figure 3 center), and pattern similar to “epidemic” but with much noise (Figure 3 bottom). Accordingly, we classify the syndromes into three classes.
Figure 3. Different temporal patterns among the syndromes with zero masks.
The class with the “epidemic” pattern (red) includes 6 syndromes: BACK INJ and BACK INJURY, CONJUNCTIVITIS RED, ABNORMAL LABS, NOSE, and ENCEPHALITIS; the frequencies vary from 37,458 to 37,933 (or 75,695 for BACK INJ together with BACK INJURY). When we selected potentially relevant symptoms, we omitted injuries and known diseases, including conjunctivitis and encephalitis. Also, we considered abnormal labs as semantically irrelevant. The word “nose” alone does not denote a meaningful symptom.
The class with the “noisy epidemic” pattern (blue) includes 23 time series corresponding to different spelling variants of the syndromes vaginal bleeding, pregnant bleeding, pregnancy with vaginal bleeding, vaginal {pain, itching, spotting, discharge, discomfort}, and possible miscarriage. The frequencies vary from 624 to 5830. We believe that these syndromes are not semantically related to the pandemic (however, a domain expert may have a different opinion).
We conclude that we did not miss any potentially relevant symptoms and syndromes.
Our system is designed primarily for analysis of spatial data with the use of maps. To better utilize its visualization facilities, we created a map of syndromes: we applied Sammon’s projection to the feature vectors and then built Voronoi polygons around the points within the projection. The temporal variation profiles can be visualized on the map of syndromes (Figure 4). The coloring of the polygons corresponds to the type of pattern: epidemic (red) or non-epidemic (green).
Figure 4. The temporal variation profiles on the map of syndromes.
In Figure 5, the pie charts represent the total counts of syndrome occurrences (whole circles) and the respective counts of the deaths (blue sectors). The counts have been obtained by aggregating the data over the whole time span.
Figure 5. The pie charts show the total numbers of syndrome occurrences and the numbers of deaths.
We compute the death rates by dividing the death counts by the total hospitalization counts. The death rates for the epidemic syndromes range from 4.938% (vomiting + headache) to 10.381% (tremors); for the non-epidemic ones – from 0.053% to 0.192%. The total number of hospitalizations with the epidemic syndromes is 3,961,060; the number of deaths is 324,520; the death rate is 8.193%.
We also compute the statistics of the times that passed from the hospitalizations to the deaths. For the epidemic syndromes, it was 8 days in almost all (about 320,000). The times for the non-epidemic syndromes are normally distributed from 0 to 9.
To determine the time periods of the different stages of the pandemic development, we filter out the non-epidemic syndromes and transpose the time series of the epidemic syndromes into a table with rows corresponding to the dates and columns to the syndromes. We apply the table lens technique to the table display. The periods of initial low values, increase, peak, decrease, and final low values become easily discernible. We use the table display and interactive classification tool to determine the exact time intervals corresponding to these stages (see the video). The approximate periods are (day/month): 16/04-25/04 (before the pandemic), 26/04-13/05 (increase), 14/05-17/05 (peak), 18/05-09/06 (decrease), and 10/06-29/06 (after the pandemic).
Now we aggregate the data by cities analogously to the previous aggregation by syndromes. We aggregate only the records corresponding to the epidemic syndromes. As a result, we obtain for each city a time series of the hospitalization counts and a time series of the death counts. We also obtain the total counts of the hospitalizations and deaths for the whole time span.
The cities differ by their population. For a valid comparison, we divide the counts of the hospitalizations and deaths by the respective population numbers, which have been obtained from Wikipedia.
In Figure 6 top, the temporal variation profiles of the absolute hospitalization counts by cities are shown on a geographic map. The highest numbers are attained in Karachi. In Figure 6 bottom, the relative numbers (hospitalization counts in percentage to the city population) are represented. The highest relative numbers are attained in Aleppo. The values in Aden, Tolima, and Nairobi are also higher than in Karachi. In Figure 7, a part of the territory is enlarged to reduce overlapping of the symbols. Again, the absolute values are on the top and relative in the bottom.
Figure 6. The temporal variation of the epidemic syndrome occurrences by cities. Top: absolute counts, bottom: percentages of the city population numbers.
Figure 7. Enlarged fragments of the maps from Figure 6.
The map suggests that Mersin and Nonthaburi were not affected by the pandemic. We check this by applying the transformation to the normalized deviations from the means, as we did for the syndromes (Figure 8). We see that Nonthaburi and Mersin have random fluctuation patterns, which distinguishes them from the other cities.
Figure 8. The temporal variation profiles with the values transformed to normalized deviations from the means. Top: all cities; bottom: a part of the territory enlarged.
We also compute the death rate for each city as the total death counts divided by the total hospitalization counts. The highest death rates were in Aleppo and Nairobi. The computed statistics can be seen in the table view in Figure 9. The rows are ordered according to the relative numbers of hospitalizations.
Figure 9. The statistics of the hospitalizations, deaths, and death rates by cities.
To establish the timing of the epidemic in each city, we do the following. In the analysis of the syndromes (MC1), we identified the periods 16/04-25/04/2009 and 10/06-29/06/2009 as the periods before and after the pandemic, respectively. We select only the hospitalization counts for the dates within these periods and compute for each city the “standard” average number of hospitalizations and the “standard” variance in the time beyond the pandemic (all computations are supported by the interactive tools of the system). Next, we transform the time series to the differences from the “standard” values divided by the “standard” variance. Then, we transpose the transformed time series into a table with the rows corresponding to the dates and columns corresponding to the cities. We use a table display with the table lens technique (Figure 10). The rows are colored according to the earlier identified periods of the overall pandemic development: before the pandemic (orange), increase (red), peak (yellow), decrease (blue), and after the pandemic (cyan).
Figure 10. Transposed time series of the normalized deviations from the “standard” hospitalization counts in the quiet time (before and after the pandemic).
To decide what deviation from the “standard” value can be treated as denoting that the epidemic takes place, we look at the maximum deviations in the periods before and after the pandemic (last column in Figure 10). All but one values are lower than 3. The value 3.171 was attained in Nairobi on 24/04/2009, which is quite close to the identified approximate date of the pandemic start (26/04/2009). Hence, we take 3 as the threshold. For each city, we assume that the date of the epidemic start is when the deviation from the “standard” average first exceeds 3 “standard” variances and the date of the epidemic end is the next date after the last deviation higher than 3. The table below gives the dates of the start, peak, and end of the epidemic in each city. The rows are ordered according to the time of the epidemic start.